Incremental Shared Nearest Neighbor Density-Based Clustering Algorithms for Dynamic Datasets
نویسنده
چکیده
Dynamic datasets undergo frequent changes where small number of data points are added and deleted. Such dynamic datasets are frequently encountered in many real world applications such as search engines and recommender systems. Incremental data mining algorithms process these updates to datasets efficiently to avoid redundant computation. Shared nearest neighbor density based clustering (SNN-DBSCAN) is a widely used clustering algorithm, mainly for its robustness. Existing incremental extension to SNNDBSCAN cannot handle deletions to dataset and handles insertions only point by point. We overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We present three different incremental algorithms with varying efficiency at elimination of redundant computation. We show effectiveness of our algorithms by performing experiments on large synthetic as well as real world datasets. Our algorithms are up to 2 Orders of Magnitude faster than nonincremental algorithm and up-to 50 times faster than existing incremental algorithm while guaranteeing exact same output.
منابع مشابه
Batch Incremental Shared Nearest Neighbor Density Based Clustering Algorithm for Dynamic Datasets
Incremental data mining algorithms process frequent updates to dynamic datasets efficiently by avoiding redundant computation. Existing incremental extension to shared nearest neighbor density based clustering (SNND) algorithm cannot handle deletions to dataset and handles insertions only one point at a time. We present an incremental algorithm to overcome both these bottlenecks by efficiently ...
متن کاملCoherent Gene Expression Pattern Finding Using Clustering Approaches
Analysis of gene expression data is an important research field in DNA microarray research. Data mining techniques have proven to be useful in understanding gene function, gene regulation, cellular processes and subtypes of cells. Most data mining algorithms developed for gene expression data deal with the problem of clustering. The purpose of this thesis is to study different clustering approa...
متن کاملA Survey Paper on Data Clustering using Incremental Affine Propagation
Clustering domain is vital part of data mining domain and widely used in different applications. In this project we are focusing on affinity propagation (AP) clustering which is presented recently to overcome many clustering problems in different clustering applications. Many clustering applications are based on static data. AP clustering approach is supporting only static data applications, he...
متن کاملSoftware Cost Estimation by a New Hybrid Model of Particle Swarm Optimization and K-Nearest Neighbor Algorithms
A successful software should be finalized with determined and predetermined cost and time. Software is a production which its approximate cost is expert workforce and professionals. The most important and approximate software cost estimation (SCE) is related to the trained workforce. Creative nature of software projects and its abstract nature make extremely cost and time of projects difficult ...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016